A Study on Multi-word Extraction from Chinese Documents

نویسندگان

  • Wen Zhang
  • Taketoshi Yoshida
  • Xijin Tang
چکیده

As a sequence of two or more consecutive individual words inherent with contextual semantics of individual words, multi-word attracts much attention from statistical linguistics and of extensive applications in text mining. In this paper, we carried out a series studies on multi-word extraction from Chinese documents. Firstly, we proposed a new statistical method, augmented mutual information (AMI), for words’ dependency. Experiment results demonstrate that AMI method can produce a recall on average as 80% and its precision is about 20%-30%. Secondly, we attempt to utilize the variance of occurrence frequencies of individual words in a multi-word candidate to deal with the rare occurrence problem. But experimental results cannot validate the effectiveness of variance. Thirdly, we developed a syntactic method based on lexical regularities of Chinese multi-word to extract the multi-words from Chinese documents. Experimental results demonstrate that this syntactical method can produce a higher precision on average as 0.5521 than AMI method but it cannot produce a comparable recall. Finally, the possible breakthrough on combining statistical methods and syntactical methods is shed light on.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distribution of Multi-Words in Chinese and English Documents

As a hybrid of N-gram in natural language processing and collocation in statistical linguistics, multi-word is becoming a hot topic in area of text mining and information retrieval. In this paper, a study concerning distribution of multi-words is carried out to explore a theoretical basis for probabilistic term-weighting scheme. Specifically, the Poisson distribution, zero-inflated binomial dis...

متن کامل

Chinese Language IR based on Term Extraction

In this paper, we’ll describe the core technology and modules we use in LIT (formerly KRDL)’s Chinese Language Information Retrieval System. The system mainly includes automatic term extraction from Chinese documents, query analysis based on the terms and finally measurement of the association between queries and documents. Compared with other methods, we try to use automatically acquired terms...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

Extracting Chinese Multi-Word Units from Large-Scale Balanced Corpus

Automatic Multi-word Units Extraction is an important issue in Natural Language Processing. This paper has proposed a new statistical method based on a large-scale balanced corpus to extract multi-word units. We have used two improved traditional parameters: mutual information and log-likelihood ratio, and have increased the precision for the top 10,000 words extracted through the method to 80....

متن کامل

Space characters in Chinese semi-structured texts

Space characters can have an important role in disambiguating text. However, few, if any, Chinese information extraction systems make full use of space characters. However, it seems that treatment of space characters is necessary, especially in cases of extracting information from semi-structured documents. This investigation aims to address the importance of space characters in Chinese informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008